Objective: to learn how to generate your own functions and perform parallel computations.
R allows the user to create functions. The mode of these expressions is function, are stored in a special form, and might be used in further expressions.
A function is defined as follows:
function_name is the name of the function defined by the user. We do not recommend to use existing function names.
argn are the different arguments of the function.
expression is the process that the function will follow when is applied.
The function is then applied as follows:
As an example, let’s generate a function that counts the characteres in a string. First we will start with generating the skeleton og the function and adding a parameter named word:
Then, we have to apply the nchar function.
Finally, we can include a message with the result. Note that we will use the return function to explicitly indicate the output of our function.
Now let’s apply the function:
count_char(word = "Oscar Baez")
## [1] "The word Oscar Baez has 10 characters!"
count_char(word = "Spatial analysis")
## [1] "The word Spatial analysis has 16 characters!"
count_char(word = 10:14)
## [1] "The word 10 has 2 characters!" "The word 11 has 2 characters!"
## [3] "The word 12 has 2 characters!" "The word 13 has 2 characters!"
## [5] "The word 14 has 2 characters!"In the last example, we provided a numeric vector to the word parameter. We could control the input of the parameter. For example:
count_char <- function(word){
if(!is.character(word))
stop("The input is not a character!")
n <- nchar(word)
n <- paste("The word", word, "has", n, "characters!")
return(n)
}The stop function will exit the execution of the function if the condition is not met.
How would you create a function that gives you the maximum value in a given vector?
If we apply it:
Now that we have created our own function, how can we apply it to a raster stack to get the maximum value per grid-cell over a defined period? For this purpose, we can use the app function from the terra package.
For this purpose, we will use monthly CHIRPSv2 data for 2000.
print(chirps)
## class : SpatRaster
## dimensions : 2000, 7200, 12 (nrow, ncol, nlyr)
## resolution : 0.05, 0.05 (x, y)
## extent : -180, 180, -50, 50 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## sources : chirps_monthly2000-01.tif
## chirps_monthly2000-02.tif
## chirps_monthly2000-03.tif
## ... and 9 more source(s)
## names : chirp~00-01, chirp~00-02, chirp~00-03, chirp~00-04, chirp~00-05, chirp~00-06, ...
## min values : 0, 0, 0, 0, 0, 0, ...
## max values : 1251.456, 1467.558, 1222.492, 1224.301, 1322.237, 1584.747, ...Now, we can apply the app function to obtain the maximum value over each grid-cell.
We will compute the mean annual precipitation using 12 monthly files of CHIRPSv2 over the Nile Basin countries in this example:
print(countries)
## class : SpatVector
## geometry : polygons
## dimensions : 15, 9 (geometries, attributes)
## extent : 12.19545, 47.98618, -13.45568, 31.66734 (xmin, xmax, ymin, ymax)
## source : Nile_Basin_Countries_GAUL2014_2.shp
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## names : STATUS DISP_AREA ADM0_CODE ADM0_NAME STR0_YEAR EXP0_YEAR
## type : <chr> <chr> <int> <chr> <int> <int>
## values : Member State NO 205 Rwanda 1000 3000
## Member State NO 133 Kenya 1000 3000
## Member State NO 74 South Sudan 2011 3000
## Shape_Leng Shape_Area Name_label
## <num> <num> <chr>
## 8.127 2.064 Rwanda
## 48.55 47.35 Kenya
## 46.91 51.6 South SudanTherefore, we want to create a function that:
Once that the function is ready, it can be applied to the different features of the shapefile.
First, we can prepare the data that will be used as input in the function. We will save the raster paths in an object called raster_paths and the shapefile in an object named shape.
print(shape)
## class : SpatVector
## geometry : polygons
## dimensions : 1, 9 (geometries, attributes)
## extent : 28.86187, 30.89965, -2.839881, -1.047913 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## names : STATUS DISP_AREA ADM0_CODE ADM0_NAME STR0_YEAR EXP0_YEAR
## type : <chr> <chr> <int> <chr> <int> <int>
## values : Member State NO 205 Rwanda 1000 3000
## Shape_Leng Shape_Area Name_label
## <num> <num> <chr>
## 8.127 2.064 RwandaThen, we can create the function. We will call it get_map for mean annual precipitation.
This function, requires two inputs, the 12 paths of the global precipitation rasters (one for each month o the year), and the shapefile. We can give the paths of the objects and read them within the function.
At the end, we will assign these objects to the function i.e., raster_paths = raster_paths and shape = shape. The first step would be to import the raster data using the rast function from the terra package.
The second step would be to crop and mask the raster stack to the extent of the shapefile.
Finally, we can sum the 12 precipitation layers and return the result.
Once we have the function ready, we can apply it:
Processing large amounts of spatial data can be very time consuming; however, most of the R code runs on a single processor. Most of the time this is not a problem but sometime, these computations can be:
Time consuming
Memory consuming
Great time investment in reading or writing files
Great time in transferring data
Traditional computers have a single CPU, that in turn can contain multiple cores. These processors and cores are able to perform computations (note that modern computers can have multiple processors).
The following function generates a list of prime numbers up to a designed number. This function was taken from stackoverflow.
prime_numbers <- function(n){
n <- as.integer(n)
if(n > 1e6) stop("n too large")
primes <- rep(TRUE, n)
primes[1] <- FALSE
last.prime <- 2L
for(i in last.prime:floor(sqrt(n)))
{
primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
last.prime <- last.prime + min(which(primes[(last.prime+1):n]))
}
which(primes)
}
prime_numbers(100)
## [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97We will use the function Sys.time to assess the time needed by this function to retrieve prime numbers for a sequence starting in 10 and ending in 15,000.
To parallelise the code we will use the doParallel package. This package provides functions for parallel execution of R code on machines with multiple cores or processors or multiple computers.
We have to indicate the number of cores to be used. For this purpose, we can use the detectCores function.
It is not recommended to use all cores so your computer don’t freeze; therefore:
Afterwards, we have to register the parallel backend. For this purpose, we will use the registerDoParallel function.
Finally, the for iterative process will be different. Instead of for, we will use foreach; instead of in we will use =. Afterwards, we can explicitly indicate which packages will be used with .packages(we will go through this in the example). Finally, the operator %dopar% should be placed before opening the foreach.
The function stopImplicitCluster can be used in vignettes and other places where it is important to explicitly close the implicitly created cluster.
Coming back to the example of extracting mean annual precipitation by country, let’s obtain the mean annual precipitation for all Nile Basin countries.
Checking at the data, we can see that there are disputed areas in the shapefile. As we want only the countries we can exclude them.
countries <- countries[which(countries$DISP_AREA == "NO"),]
data.frame(countries)
## STATUS DISP_AREA ADM0_CODE ADM0_NAME STR0_YEAR
## 1 Member State NO 205 Rwanda 1000
## 2 Member State NO 133 Kenya 1000
## 3 Member State NO 74 South Sudan 2011
## 4 Member State NO 257 United Republic of Tanzania 1000
## 5 Member State NO 253 Uganda 1000
## 6 Member State NO 79 Ethiopia 1000
## 7 Member State NO 77 Eritrea 1000
## 8 Member State NO 43 Burundi 1000
## 9 Member State NO 68 Democratic Republic of the Congo 1000
## 10 Member State NO 6 Sudan 2011
## 11 Member State NO 40765 Egypt 1000
## EXP0_YEAR Shape_Leng Shape_Area Name_label
## 1 3000 8.127003 2.063852 Rwanda
## 2 3000 48.549430 47.345770 Kenya
## 3 3000 46.905431 51.599166 South Sudan
## 4 3000 82.950601 76.860266 United Republic of Tanzania
## 5 3000 23.244416 19.629674 Uganda
## 6 3000 50.380131 92.869258 Ethiopia
## 7 3000 50.005783 10.167958 Eritrea
## 8 3000 9.118459 2.185652 Burundi
## 9 3000 98.951177 189.998107 Democratic Republic of the Congo
## 10 3000 81.910242 155.888802 Sudan
## 11 3000 61.251157 89.079113 EgyptNow, we can apply the function get_map for every country. As this process surely will be time consuming we can paralelise it. First, here is the get_map function.
First we store the raster file paths into an object as we did before. Then we can set the output path where we want to store the resuting files.
Then we construct the foreach loop as follows:
Set the number of cores with detectCores
Register the cluster with registerDoParallel
Start the foreach iterative process
Indicate the packages that will be used. In this case terra
Include the parallel operator %dopar%
Develop the process and export the raster data
Close the loop
Close the parallel cluster if needed with stopImplicitCluster
cores <- detectCores() - 1
registerDoParallel(cores = cores)
foreach(i = 1:nrow(countries), .packages = "terra") %dopar% {
r <- get_map(raster_paths = raster_paths, shape = countries[i,])
n <- countries[i,]$Name_label
n <- paste0(output, "/", n, ".tif")
writeRaster(r, n, overwrite = TRUE)
}
stopImplicitCluster()After the process, we should have the rasters in the output folder.